Here we present two datasets where curves do not have the same duration.
Let us investigate two common approaches to time normalisation prior to GAMM modelling.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(t1, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1850974 0.0006736 274.77 <2e-16 ***
## CategoryPEAK 0.0145556 0.0009525 15.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(t1):CategoryNO_PEAK 8.917 8.998 4563 <2e-16 ***
## s(t1):CategoryPEAK 8.933 8.999 3116 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.778 Deviance explained = 77.8%
## fREML = -25458 Scale est. = 0.0045003 n = 19900
Notice that:
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(t_lin, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.179668 0.000227 791.50 <2e-16 ***
## CategoryPEAK 0.024315 0.000321 75.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(t_lin):CategoryNO_PEAK 8.979 9 47803 <2e-16 ***
## s(t_lin):CategoryPEAK 8.998 9 36802 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.975 Deviance explained = 97.5%
## fREML = -47050 Scale est. = 0.00051256 n = 19900
In this case the result is excellent (var explained 97%). In fact, the data were generated by randomly applying a random linear time distorsion to each curve.
Suppose the data look like this instead:
Let us apply the same approaches as in dataset 1.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(time, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2247908 0.0002404 934.88 <2e-16 ***
## CategoryPEAK 0.0190229 0.0003400 55.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time):CategoryNO_PEAK 8.966 9 34965 <2e-16 ***
## s(time):CategoryPEAK 8.998 9 24985 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.972 Deviance explained = 97.2%
## fREML = -38123 Scale est. = 0.00045457 n = 15733
This is the perfect solution in this case. In fact, the data were generated by chopping the curves randmoly towards the end.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(t_lin, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2251063 0.0005743 392.0 <2e-16 ***
## CategoryPEAK 0.0182736 0.0008121 22.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(t_lin):CategoryNO_PEAK 8.746 8.981 5480 <2e-16 ***
## s(t_lin):CategoryPEAK 8.834 8.992 3602 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.839 Deviance explained = 83.9%
## fREML = -24456 Scale est. = 0.0025932 n = 15733
This time linear time normalisation creates distortions and artifacts in the solution.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2119276 0.0007156 296.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.92 8.998 3364 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.753 Deviance explained = 75.3%
## fREML = -12112 Scale est. = 0.0050952 n = 9950
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ s(t_lin)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2119276 0.0004548 465.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(t_lin) 8.992 9 9956 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.9 Deviance explained = 90%
## fREML = -16610 Scale est. = 0.0020585 n = 9950
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ te(t_lin, duration, k = 10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2119276 0.0004486 472.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(t_lin,duration) 71.79 82 1127 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.903 Deviance explained = 90.3%
## fREML = -16640 Scale est. = 0.0020025 n = 9950
Notice that:
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(time, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.11650 0.01309 8.902 <2e-16 ***
## CategoryPEAK 0.19460 0.01310 14.856 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time):CategoryNO_PEAK 7.151 7.376 6392 <2e-16 ***
## s(time):CategoryPEAK 8.975 9.000 10029 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.891 Deviance explained = 89.2%
## fREML = -30138 Scale est. = 0.0028116 n = 19900
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + s(t_lin, by = Category)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1415134 0.0003388 417.6 <2e-16 ***
## CategoryPEAK 0.1216671 0.0004792 253.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(t_lin):CategoryNO_PEAK 8.927 8.998 13318 <2e-16 ***
## s(t_lin):CategoryPEAK 8.997 9.000 27498 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.956 Deviance explained = 95.6%
## fREML = -39085 Scale est. = 0.0011421 n = 19900
##
## Family: gaussian
## Link function: identity
##
## Formula:
## y ~ Category + te(t_lin, duration, by = Category, k = 10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13807 0.03382 4.082 4.48e-05 ***
## CategoryPEAK 0.24412 0.03936 6.202 5.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(t_lin,duration):CategoryNO_PEAK 50.69 56.23 2131 <2e-16 ***
## te(t_lin,duration):CategoryPEAK 56.19 58.48 4218 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.956 Deviance explained = 95.6%
## fREML = -38897 Scale est. = 0.0011445 n = 19900
Notice that:
This is to be expected because there are no samples of long NO_PEAK curves and of shot PEAK curves.
Besides that, using duration as covariate in a categorical setting is conceptually wrong, as duration is part of how the curve looks, just like its shape, and the purpose of the model should be to predict how the curve looks given its category. Duration is not part of the experimental design, so it should be a response variable, not a predictor.
An approach based on multidimensional functional PCA allows to jointly analyse curves together with their time distortion representation. This is more general than applying linear time normalisation and it is based on landmark registration.
See this
paper, appendix A in this
paper, this
paper and material available in this repo (under notes/
and projects/).